The FM-Index: A Compressed Full-Text Index Based on the BWT

نویسندگان

  • Paolo Ferragina
  • Giovanni Manzini
چکیده

In this talk we address the issue of indexing compressed data both from the theoretical and the practical point of view. We start by introducing the FM-index data structure [2] that supports substring searches and occupies a space which is a function of the entropy of the indexed data. The key feature of the FM-index is that it encapsulates the indexed data (self-index) and achieves the space reduction at no significant slowdown in the query performance. Precisely, given a text T [1, n] to be indexed, the FM-index occupies at most 5nHk(T )+o(n) bits of storage, where Hk(T ) is the k-th order entropy of T , and allows the search for the occ occurrences of a pattern P [1, p] within T in O(p + occ log n) time, where > 0 is an arbitrary constant fixed in advance. The design of the FM-index is based upon the relationship between the Burrows-Wheeler compression algorithm [1] and the suffix array data structure [9]. It is therefore a sort of compressed suffix array that takes advantage of the compressibility of the indexed data in order to achieve space occupancy close to the Information Theoretic minimum. Indeed, the design of the FM-index does not depend on the parameter k and its space bound holds simultaneously over all k ≥ 0. These remarkable theoretical properties have been validated by experimental results [3, 4] and applications [7, 10]. In particular it has been shown that the FM-index achieves a space occupancy close to the best known compressors and, unlike them, it allows to search for arbitrary substrings in a hundred of megabytes within few millisecs, since it does not decompress the whole file. We will conclude the talk by sketching two intriguing variants of the FM-index. One achieves O(p + occ) query time (i.e. output sensitivity) and uses O(nHk(T ) log n) + o(n) bits of storage. This data structure exploits the interplay between two compressors: the Burrows-Wheeler algorithm and the LZ78 algorithm [11]. Our other proposal [8] combines two recent and elegant techniques—the compression boosting [5] and the wavelet tree [6]—to design a variant of the FM-index that scales well with the size of the input alphabet.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A bloated FM-index reducing the number of cache misses during the search

The FM-index is a well-known compressed full-text index, based on the Burrows–Wheeler transform (BWT). During a pattern search, the BWT sequence is accessed at “random” locations, which is cache-unfriendly. In this paper, we are interested in speeding up the FMindex by working on q-grams rather than individual characters, at the cost of using more space. The first presented variant is related t...

متن کامل

Entropy-Compressed Indexes for Multidimensional Pattern Matching

In this talk, we will discuss the challenges involved in developing a multidimensional generalizations of compressed text indexing structures. These structures depend on some notion of Burrows-Wheeler transform (BWT) for multiple dimensions, though naive generalizations do not enable multidimensional pattern matching. We study the 2D case to possibly highlight combinatorial properties that do n...

متن کامل

Implementation Structure

This document describes the structure of the classes chosen to implement the FM-Index by Ferragina and Manzini in SeqAn. In addition the steps necessary to implement the index will be shown and described. This document is based on [3]. A very short description of the Index can be found here: FM-Index wiki. This document will describe several different implementation of the FMIndex. Even though ...

متن کامل

Run-Length FM-index

The FM-index is a succinct text index needing only O(Hkn) bits of space, where n is the text size and Hk is the kth order entropy of the text. FM-index assumes constant alphabet; it uses exponential space in the alphabet size, σ. In this paper we show how the same ideas can be used to obtain an index needing O(Hkn) bits of space, with the constant factor depending only logarithmically on σ. Our...

متن کامل

On Entropy-Compressed Text Indexing in External Memory

A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FM-index [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve this goal by exploiting the Burrows-Wheeler tra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004